13 research outputs found

    Can we Pretrain a SotA Legal Language Model on a Budget From Scratch?

    Get PDF
    Even though many efficient transformers have been proposed, only few such models are available for specialized domains. Additionally, since the pretraining process is extremely costly in general – but even more so as the sequence length increases – it is often only in reach of large research labs. One way of making pretraining cheaper is the Replaced Token Detection (RTD) task, by providing more signal during training compared to MLM, since the loss can be computed over all tokens. In this work, we train Longformer models with the efficient RTD task on long-context legal data to showcase that pretraining efficient LMs is possibl using less than 12 GPU days. We evaluate the trained models on challenging summarization tasks requiring the model to summarize complex long texts. We find that both the small and base models outperform their baselines on the in-domain BillSum and out-of-domain PubMed tasks in their respective parameter range. We publish our models as a resource for researcher and practitioners

    An empirical study on cross-x transfer for legal judgment prediction

    Get PDF
    Cross-lingual transfer learning has proven useful in a variety of Natural Language Processing (NLP) tasks, but it is understudied in the context of legal NLP, and not at all in Legal Judgment Prediction (LJP). We explore transfer learning techniques on LJP using the trilingual Swiss-Judgment-Prediction dataset, including cases written in three languages. We find that cross-lingual transfer improves the overall results across languages, especially when we use adapter-based fine-tuning. Finally, we further improve the model's performance by augmenting the training dataset with machine-translated versions of the original documents, using a 3x larger training corpus. Further on, we perform an analysis exploring the effect of cross-domain and cross-regional transfer, i.e., train a model across domains (legal areas), or regions. We find that in both settings (legal areas, origin regions), models trained across all groups perform overall better, while they also have improved results in the worst-case scenarios. Finally, we report improved results when we ambitiously apply cross-jurisdiction transfer, where we further augment our dataset with Indian legal cases

    MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

    Get PDF
    Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130’000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available

    Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark

    Get PDF
    In many jurisdictions, the excessive workload of courts leads to high delays. Suitable predictive AI models can assist legal professionals in their work, and thus enhance and speed up the process. So far, Legal Judgment Prediction (LJP) datasets have been released in English, French, and Chinese. We publicly release a multilingual (German, French, and Italian), diachronic (2000-2020) corpus of 85K cases from the Federal Supreme Court of Switzerland (FSCS). We evaluate state-of-the-art BERT-based methods including two variants of BERT that overcome the BERT input (text) length limitation (up to 512 tokens). Hierarchical BERT has the best performance (approx. 68-70% Macro-F1-Score in German and French). Furthermore, we study how several factors (canton of origin, year of publication, text length, legal area) affect performance. We release both the benchmark dataset and our code to accelerate future research and ensure reproducibility

    Machine Learning-based Real-Time Indoor Landmark Localization

    Get PDF
    Nowadays, smartphones can collect huge amounts of data from their surroundings with the help of highly accurate sensors. Since the combination of the Received Signal Strengths of surrounding access points and sensor data is assumed to be unique in some locations, it is possible to use this information to accurately predict smartphones' indoor locations. In this work, we apply machine learning methods to derive the correlation between smartphones' locations and the received Wi-Fi signal strength and sensor values. We have developed an Android application that is able to distinguish between rooms on a floor, and special landmarks within the detected room. Our real-world experiment results show that the Voting ensemble predictor outperforms individual machine learning algorithms and it achieves the best indoor landmark localization accuracy of 94% in office-like environments. This work provides a coarse-grained indoor room recognition and landmark localization within rooms, which can be envisioned as a basis for accurate indoor positioning

    Surface antigens and potential virulence factors from parasites detected by comparative genomics of perfect amino acid repeats

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many parasitic organisms, eukaryotes as well as bacteria, possess surface antigens with amino acid repeats. Making up the interface between host and pathogen such repetitive proteins may be virulence factors involved in immune evasion or cytoadherence. They find immunological applications in serodiagnostics and vaccine development. Here we use proteins which contain perfect repeats as a basis for comparative genomics between parasitic and free-living organisms.</p> <p>Results</p> <p>We have developed Reptile <url>http://reptile.unibe.ch</url>, a program for proteome-wide probabilistic description of perfect repeats in proteins. Parasite proteomes exhibited a large variance regarding the proportion of repeat-containing proteins. Interestingly, there was a good correlation between the percentage of highly repetitive proteins and mean protein length in parasite proteomes, but not at all in the proteomes of free-living eukaryotes. Reptile combined with programs for the prediction of transmembrane domains and GPI-anchoring resulted in an effective tool for in silico identification of potential surface antigens and virulence factors from parasites.</p> <p>Conclusion</p> <p>Systemic surveys for perfect amino acid repeats allowed basic comparisons between free-living and parasitic organisms that were directly applicable to predict proteins of serological and parasitological importance. An on-line tool is available at <url>http://genomics.unibe.ch/dora</url>.</p

    MultiLegalPile: A 689GB Multilingual Legal Corpus

    Get PDF
    Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, there are few datasets available for specialized critical domains such as law and the available ones are often only for the English language. We curate and release MULTILEGALPILE, a 689GB corpus in 24 languages from 17 jurisdictions. The MULTILEGALPILE corpus, which includes diverse legal data sources with varying licenses, allows for pretraining NLP models under fair use, with more permissive licenses for the Eurlex Resources and Legal mC4 subsets. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEXTREME and our English models on LexGLUE. We release the dataset, the trained models, and all of the code under the most open possible licenses

    Tablet-Based Puzzle Game Intervention for Cognitive Function and Well-Being in Healthy Adults: Pilot Feasibility Randomized Controlled Trial.

    Get PDF
    BACKGROUND Promoting cognitive health is key to maintaining cognitive and everyday functions and preventing the risk of cognitive impairment or dementia. Existing scientific evidence shows the benefits of various training modalities on cognition. One way to promote cognitive health is through engagement in cognitive activities (eg, board and video games). OBJECTIVE This study aims to investigate the benefits of dynamic adaptive casual puzzle games on cognitive function and well-being in healthy adults and older people. METHODS A total of 12 adults and older people (female participants: n=6; mean age 58.92, SD 10.28 years; range 46-75 years) were included in this pilot randomized controlled trial. This study used a crossover design with two phases (8 weeks each) and three measurement waves (pretest, midtest, and posttest). The participants were randomly allocated either to the control or experimental group. In the control group, participants read newspapers between the pre- and midtest, then switched to cognitive training with puzzle games. In the experimental group, the interventions were reversed. Baseline measurements (pretest) were collected before the intervention. The interventions were delivered on tablet computers and took place unsupervised at participants' homes. RESULTS The outcome measures included global cognitive function, higher cognitive function, and emotional well-being at 3 time points (pretest, midtest, and posttest) using standardized neuropsychological tests. The participants showed improvements in their visual attention and visuospatial measures after the puzzle game intervention. CONCLUSIONS The study showed that digital games are a feasible way to train cognition in healthy adults and older people. The algorithm-based dynamic adaption allows accommodations for persons with different cognitive levels of skill. The results of the study will guide future prevention efforts and trials in high-risk populations

    SCALE: Scaling up the Complexity for Advanced Language Model Evaluation

    Get PDF
    Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professional domain-specific ones), emphasizing the need for novel, more challenging novel ones to properly assess LLM capabilities. In this paper, we introduce a novel NLP benchmark that poses challenges to current LLMs across four key dimensions: processing long documents (up to 50K tokens), utilizing domain specific knowledge (embodied in legal texts), multilingual understanding (covering five languages), and multitasking (comprising legal document to document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks). Our benchmark comprises diverse legal NLP datasets from the Swiss legal system, allowing for a comprehensive study of the underlying Non-English, inherently multilingual, federal legal system. Despite recent advances, efficiently processing long documents for intense review/analysis tasks remains an open challenge for language models. Also, comprehensive, domain-specific benchmarks requiring high expertise to develop are rare, as are multilingual benchmarks. This scarcity underscores our contribution’s value, considering most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. Our benchmark allows for testing and advancing the state-of-the-art LLMs. As part of our study, we evaluate several pre-trained multilingual language models on our benchmark to establish strong baselines as a point of reference. Despite the large size of our datasets ∗ Equal contribution. (tens to hundreds of thousands of examples), existing publicly available models struggle with most tasks, even after in-domain pretraining. We publish all resources (benchmark suite, pre-trained models, code) under a fully permissive open CC BY-SA license
    corecore